Skip to content

Fix integer division in failed-service-start notification heuristic#14815

Open
fethij wants to merge 1 commit into
signalapp:mainfrom
fethij:fix-slow-notification-failure-rate-integer-division
Open

Fix integer division in failed-service-start notification heuristic#14815
fethij wants to merge 1 commit into
signalapp:mainfrom
fethij:fix-slow-notification-failure-rate-integer-division

Conversation

@fethij
Copy link
Copy Markdown

@fethij fethij commented May 28, 2026

Problem

SlowNotificationHeuristics.hasRepeatedFailedServiceStarts() decides whether a user is failing to start the FCM service too often by comparing the failure ratio against serviceStartFailurePercentage (a Float, default 0.5). However, the ratio is computed with integer division:

if (failures.size / (failures.size + successes.size) >= failurePercentage) {

Both failures.size and successes.size are Int, so failures.size / (failures.size + successes.size) is integer division. Because failures.size is always <= the total, this evaluates to:

  • 0 whenever there is at least one success (i.e. any mixed ratio), and
  • 1 only when there are zero successes (100% failures).

Comparing that integer result against 0.5f means the branch only ever passes at a 100% failure rate, never at the configured 50% (or any other fractional threshold). In practice the failed-service-start signal of the delayed-notifications heuristic is dead for the realistic case of a user who fails, say, 50–99% of service starts.

The sibling heuristics (isFailingToDrainQueue, hasLongMessageLatency) compare raw counts/latencies, so they are unaffected — this is the only place a fractional rate is computed.

Fix

Promote the numerator to Float so the comparison uses the intended fractional failure rate:

-    if (failures.size / (failures.size + successes.size) >= failurePercentage) {
+    if (failures.size.toFloat() / (failures.size + successes.size) >= failurePercentage) {

Division by zero is not a concern here: an earlier guard returns when (successes.size + failures.size) < minimumEventCount (default 10).

Notes

One-line change, no behavior change for the 100%-failure case; it now also correctly triggers for the intended fractional thresholds.

hasRepeatedFailedServiceStarts compared the failure ratio against
serviceStartFailurePercentage (a Float, default 0.5), but computed the
ratio with integer division:

    failures.size / (failures.size + successes.size)

Since failures.size is always <= the total, this integer division
evaluates to 0 in every case except when there are zero successes, where
it is 1. As a result the check only ever passes at a 100% failure rate
rather than the configured percentage, so the heuristic effectively never
detects users with a mixed (e.g. 50-99%) FCM service-start failure rate.

Convert the numerator to Float so the comparison uses the intended
fractional failure rate.
@mtang-signal
Copy link
Copy Markdown
Contributor

Thanks, this will go out in 8.14.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants